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features. Finally, a Support Vector Machine (SVM) is trained to produce a 
Keywords: classifier that distinguishes whether the feature vector belongs to a fingertip 
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1, INTRODUCTION 

Medical applications, including hand rehabilitation for stroke survivors, have benefited from the 
advances in technology for many years. The exploitation of computer vision in this application field has not 
been spared and has been the subject of many research works. Although computer vision technology has 
been advancing rapidly throughout the years, there are still some difficult challenges that relate to vision- 
based approach for fingertip detection that need to be overcome. The challenges that need to deal with are (1) 
the non-rigid nature of hands possessing a high degree of freedom that makes it difficult to match various 
shapes of fingers with a set of images, (2) there is a variety of orientation and appearance of finger; thus it is 
difficult to detect the shape and posture of the fingers accurately and robustly, and (3) slight differences may 
lead to substantial error in the case of fingertips that belongs to the same person [1]. These challenges get 
even more significant when commercial vision systems are used, instead of those of industrial grade. 

In this paper, a potential solution using machine learning is to be used in hand rehabilitation. One of 
the widely practiced rehabilitation exercise is by asking the patient to squeeze a flexible exercise ball in 
his/her hands repetitively [2]. The balls have various levels of resistance to accommodate the various levels 
of limitation of the patients’ hands. However, one of the challenges is to measure objectively or 
quantitatively the progress that has been made if any. Machine-vision-based system may offer a non-intrusive 
way of measurement of fingertip position. Some present rehabilitation is assisted by machine vision based 
system involves the interaction between human and virtual world. Detection and tracking of fingertip are 
essential in a recognition of fingertip in a contactless position measurement. 

There have been works on fingertip detection using machine vision by other researchers. An engine 
development for fingertip detection in real-time that is targeted at mobile devices for the Natural User 
Interfaces (NUIs) [3]; system development that is capable of detecting fingertip in a reliable manner in 
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complex environment under different light conditions, different scenes without any markers [4]; Feng et al. 
(2012) used Kinect sensor for fingertip detection for writing-in-the-air character recognition system; an 
approach that allows the detection of hand and fingertip with or without illumination in cluttered background 
[5]. It should be noted that not all hand gesture recognition would require the determination of the position of 
the fingertips. They may just rely on the overall shape of the hand [6]. 

This paper is organized as follows. The related work on fingertip detections 1s reviewed in Section 
2. In section 3, the proposed algorithm for fingertip detection the experimental results are presented and 
discussed. Finally, the summary of the work is presented in the concluding section. 


2. FINGERTIP DETECTION ALGORITHM 
2.1. Bag of Words 

Bag of words (BoW) model has been used in machine vision for around a decade. The model was 
originally applied in natural language analysis where a text document is represented in a histogram of words 
without considering the grammar and the order or the location of the words in the text. The model would 
build a dictionary consisting the vocabulary of words it has found in the texts that are fed into the model as 
the input. When it comes to the application in machine vision, the model has been popular due to its 
simplicity and effectiveness [7] and it is also widely known as bag of visual words and bag of features. The 
same researchers stated that traditionally BoW employs scale-invariant feature transform (SIFT) descriptors 
that reduces the dimensionality of the feature space. 

To build the dictionary, also known as codebook, that consists of the visual words, the technique 
extracts these visual words from the training images — as illustrated by the flowchart in Figure 1. During the 
learning stage, a large set of images of different classes are used. From each image, extraction of keypoints is 
initially carried out. Subsequently, for each keypoint, feature descriptors are established which represent the 
features of the neighborhood of the keypoint. In the next step, for dimension reduction purposes, these 
descriptors are clustered into groups, which are called visual words. All the generated visual words from the 
training images are collected as the codebook, which is equivalent to a dictionary containing the vocabulary 
of words. 
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Figure 1. Extraction of Features and Generation of the Codebook 


During an image recognition stage, extraction of keypoints, defining feature descriptors and the 
clustering of the descriptors are carried out in generating the bag of words for the image, which 1s basically a 
histogram of the visual words that are present in the image, such as shown in Figure 2. 
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Figure 2 Histogram of Visual word Occurrences 
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2.2. Speeded Up Robust Features (SURF) 

SURF was introduced by Bay et al [8], which has proven to be effective and popular, thanks to its 
repeatability, distinctiveness and relatively fast speed. In various comparative works, such as [9], although 
SURF has lower number of identified features and slightly lower number of correct matches compared to its 
predecessor, Scale-Invariant Feature Transform (SIFT), it performs a higher number of correct matches per 
given time [10]. Both of these methods are scale invariant and implementable in real-time systems [11]. 
SURE consists of four stages, which are integral image generation, approximated Hessian detector, descriptor 
orientation assignment and descriptor generation [12] . For achieving high speed, following its popularization 
by Viola and Jones [13], this detection uses integral images that reduce the number of mathematical 
operations. The integral image J; 1s defined mathematically as the following: 
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Hessian matrix is used on the integral image for the localization and scaling of interest points, which 
particularly looks for blob-like structures where the high determinants of the matrix are present. The Hessian 
matrix H(X, o) in an image’s point X at scale o is defined as follows: 
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where Lxx(X,o) is the convolution of the Gaussian second order derivative aD) g(a) with the 


image I at point X, and similarly for Lxy(X,o), Lyx(X, a) and Lyy(X, 0a). 

Following interest point detection, SURF identifies an interest point descriptor around each interest 
point, which includes the dominant orientation. Each region around the interest point is split into subregions. 
For each sub-region, a vector is defined by using Haar wavelet responses. These vectors form the descriptor. 


2.3. K-Means Clustering 

K-mean clustering is one of the methods for image segmentation, which is the classification of an 
image into distinct groups. Before applying this unsupervised learning technique, an initial enhancement is 
applied to the image for image improvement. A subtractive clustering method generates centroids which is 
based on the potential value of data points. In other words, subtractive cluster is used to generate the initial 
centers which is used in K-mean algorithm for the data points [14] 

In this work, it is used to associate the generated descriptor to the right cluster, which is also known 
as visual world in the bag-of-words technique. By using this clustering, the classification stage, which is the 
next step, will deal with lower data dimension that, in turn, helps in gaining a higher processing speed. 


2.4. Support Vector Machine (SVM) 

SVM is a supervised learning method that is used for regression and classification [15]. It carries out 
classification by creating a multi-dimensional hyperplane which divides the data into two groups optimally. 
This makes SVM classifier model closely associated with neural networks. The SVM classifier model uses a 
sigmoid kernel function, which is similar to the two-layer perceptron of neural network. 


3. EXPERIMENT AND DATA GATHERING 
3.1. Experimental Setup 

In this work, a commercial high-density (HD) Logitech C615 webcam with a resolution of 1920 x 
1080 pixels has been used. An example of image captured by the webcam is as shown in Figure |. Figure | is 
an example of an image of a hand holding a therapy ball. The images are captured while the webcam facing 
upwards which is facing a light-emitting source in the ceiling. The glare from the light source contributes to 
the variation of intensity in each captured image. 

The setup for the data image gathering 1s illustrated in Figure 4. The blue circles denote the position 
of the hands where the distance between adjacent blue circles is approximately 10 cm. The distance Yhand 
denotes the perpendicular distance of the position of hands to the webcam. The webcam captured two images 
of the hand at each position. 
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Figure | Example of a Captured Image of a Therapy-Ball-Holding Hand 
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Figure 4. Experimental Setup (Blue Circles Denotes the Position of the Hands 1n the Experiment) 


3.2. Image Data Gathering 
For image data gathering, a few sets of images were captured. Different hand sizes, skin colors, and 


orientations from 10 different individuals (5 male and 5 female) were included in the captured image data. 
Examples of hands of different orientations are shown in Error! Reference source not found.. 





Figure 5. Images of Hand of Different Orientations 


Then, the images of the fingertips and non-fingertips were cropped from hand holding ball images. 
The size of the cropped images for both fingertip and non-fingertip images is 50x50 pixels. Basically, non- 
fingertip images are images that do not contain any fingertip, instead they contain the background, the ball, 
the hand wrist, etc. All the cropped images are stored in two separate folders, one of which is for fingertip 
images and the other is for non-fingertip ones. Examples from both groups of images are shown in Figure 2. 
A total of 4200 images have been obtained that will be used for both classification training and validation. 
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(a) (b) 
Figure 2 Examples of the Images: (a) Fingertip and (b) Non-fingertip 


3.3. Detection Validation Testing 

By using the image data gathered, the classification machine was then trained and then the detection 
success rate was evaluated. Figure 7 captures how the experiment and evaluation were done step-by-step. 

An array of image sets is constructed based on two main categories; fingertip and non-fingertip. The 
number of images per category as well as category labels was inspected. If the number of images are unequal 
per category, then it can be adjusted so that there will be equal number of images per category. The sets are 
then separated into training and validation sets. The splitting was randomized to prevent the results 
to be biased. 
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Figure 7. Category Classification Training 


The bag of word technique is from the natural language processing adapted to computer vision. 
Images do not contain discrete words, therefore, SURF features from each image category must be collected 
into a visual ‘vocabulary’. The visual vocabulary is constructed by reducing the number of features through 
quantization of feature space using K-mean clustering. Furthermore, the visual word occurrences in an image 
were counted by constructing a histogram to reduce the representation of an image as shown in Error! 
Reference source not found.. The encoded training images from both categories are fed into a classifier 
training process. 

During the evaluation classifier’s performance, the training set was tested and a near perfect 
confusion matrix was produced. The classifier evaluation step was also performed with validation set, which 
was not used during the training. The confusion matrix produced is a good indicator of how well the 
classifier is performing. 
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4. Experimental Results and Analysis 

In this section, we assess the success rate of the detection algorithm. In the experiment, a total of 
4200 images was used. The image data sets consist of 2 main subsets such as fingertip, and non-fingertip 
images, each with a resolution of 50 x 50 pixels. The set images are divided into three categories: training, 
validation, and unused sets. The splitting of the data sets was randomized to avoid biasing the results. 

Table 1 shows the averaged success rate for the detection of fingertip and non-fingertip when the 
number of validation images varies from 100 to 2000 images. Based on Figure 8 that shows the graphical 
representation of the data in Table 1, we observed that the highest success rate for the fingertip is 95.6% and 
for non-fingertip is 92.4%, which is acceptably high. The trend also shows that if the number of training data 
is increased, a higher success rate can be obtained, especially for the non-fingertip detection. 


Table 1. Averaged Success Rate from Validation Set 
No. of training Averaged success rate (%) 


images Fingertip —_ Non-fingertip 
100 93.8 83 
500 94.2 87.4 
1000 95.6 92 
2000 94.4 92.4 
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Figure 3. Graph of No. of Training Images vs Averaged Success Rate of the Detection of Fingertip and Non- 
fingertip using Validation Set 


A histogram of visual word occurrences was generated during classification training as shown in 
Error! Reference source not found.. The histogram forms a basis for training a classifier and for the actual 
image classification. In other words, it encodes an image into a feature vector. Each encoded training images 
in each category are fed into a classifier training. In the recognition stage, the image is represented by the 
visual words that will be distinguished by the classifier. 

Figure 4 shows typical results of the detection algorithm when the algorithm is applied scanning 
over a full image. The green detection box signifies part of the image where fingertips are detected. They 
show how the improvement has been achieved when a higher number of training images is used. The outputs 
shows a promising result in the detection. 





(a) (b) (Cc) 
Figure 4 Results of the Detection Algorithm; (a)No. of Training Images=100, (b) No. of 
Training Images=1000, (c) No. of Training Images=2000 
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5. CONCLUSIONS 

In this work, it has been shown that the method based on SURF and bag of words has been shown a 
good performance in detecting fingertips in images where a hand is holding a therapy ball that is normally 
used in a post-stroke hand therapy. The success rate was generally found to be increased when the number of 
training images were increased, especially in the correct identification of the non-fingertip, i.e. lower false 
positive detection rates. The success rate for the fingertip detection reached higher than 94% with the 
algorithm, which is reasonably high for the therapy applications, despite the use of commercial-grade 
cameras. 
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